--- layout: post title: "Implementing Machine Learning Models for Heart Disease Prediction" subtitle: "Classifiers for categorical data" date: 2022-08-19 05:27:00 -0400 tags: jupyter_notebook machine_learning classifier data_science grid_search optimization featured background: '/img/posts/hpbg.jpg' ---
sklearn was used to to generate and assess 10 different machine learning classifiers.
In addition to this, the pandas_profiling module allowed me to expedite the exploratory data analysis component of the project while the modules eli5, pdpbox and shap allowed me to look deeper into the importance and contributions of the model features.
Links to module docs are found below:
#The usual modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob
#For exploratory data analysis
import pandas_profiling as pp
#For feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#For data splitting
from sklearn.model_selection import train_test_split
# Get ML classifiers
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
#Get model metrics for model evaluation
from sklearn.metrics import confusion_matrix, mean_squared_error, r2_score, accuracy_score
from sklearn.metrics import classification_report, roc_curve, auc
from sklearn.model_selection import cross_val_predict
#For confusion matrix display
import scikitplot as skplt
#For parameter grid search optimization
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
#For simplified understanding of the importance model features
import eli5
from eli5.sklearn import PermutationImportance
#For generating partial dependence plots
from pdpbox import pdp, info_plots
#For calculation of shap values to further assess contributions of model features
import shap
read_csv function.
df = pd.read_csv('heart.csv')
df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | 1 | 0 | 125 | 212 | 0 | 1 | 168 | 0 | 1.0 | 2 | 2 | 3 | 0 |
| 1 | 53 | 1 | 0 | 140 | 203 | 1 | 0 | 155 | 1 | 3.1 | 0 | 0 | 3 | 0 |
| 2 | 70 | 1 | 0 | 145 | 174 | 0 | 1 | 125 | 1 | 2.6 | 0 | 0 | 3 | 0 |
| 3 | 61 | 1 | 0 | 148 | 203 | 0 | 1 | 161 | 0 | 0.0 | 2 | 1 | 3 | 0 |
| 4 | 62 | 0 | 0 | 138 | 294 | 1 | 1 | 106 | 0 | 1.9 | 1 | 3 | 2 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | 1 | 1 | 140 | 221 | 0 | 1 | 164 | 1 | 0.0 | 2 | 0 | 2 | 1 |
| 1021 | 60 | 1 | 0 | 125 | 258 | 0 | 0 | 141 | 1 | 2.8 | 1 | 1 | 3 | 0 |
| 1022 | 47 | 1 | 0 | 110 | 275 | 0 | 0 | 118 | 1 | 1.0 | 1 | 1 | 2 | 0 |
| 1023 | 50 | 0 | 0 | 110 | 254 | 0 | 0 | 159 | 0 | 0.0 | 2 | 0 | 2 | 1 |
| 1024 | 54 | 1 | 0 | 120 | 188 | 0 | 1 | 113 | 0 | 1.4 | 1 | 1 | 3 | 0 |
1025 rows × 14 columns
df.dtypes
age int64 sex int64 cp int64 trestbps int64 chol int64 fbs int64 restecg int64 thalach int64 exang int64 oldpeak float64 slope int64 ca int64 thal int64 target int64 dtype: object
The data has 14 attributes and 1025 observations. The attributes are:
The thal attribute corresponds to an assesment of thalassemia. The labels provided initially don't correspond to what is on the table. After a bit of research though, I found that the labels above are mapped to the following numbers on the table:
pandas_profiling extends the capabilities of the basic .describe function which allow a more extensive and quick exploratory data analysis to be carried out. This package provides the following information for each column:
pp.ProfileReport(df)
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Here is a summary of the most important takeaways from the generated report that I could see:
thalach and ca attributesthal attribute.
3.The most frequent chest pain type is typical angina (48.5%) followed by non-anginal pain (27.7%). It is highly correlated to the exang and target attributes.exang, target and age attributes. Next, we can look at the interaction plots for the interval variables (i.e., the bivariate distributions). I'll describe what some of the plots are showing:
Age and oldpeak have strong interactions in the 40-60 year old range when old peak is 0.
Age and resting blood pressure have strong interactions in the 50-60 year old range when the blood pressure is approximately 130 mmHg.
Age and cholesterol have strong interactions in the 50-60 year old range when the cholesterol levels are between 200-250 mg/dL
Age and maximum heart rate achieved have strong interactions for 60 year olds when the maximum heart rate is between 140 and 160 BPM.
Resting blood pressure and oldpeak have strong interaction at 130mmHg and oldpeak of 0
Resting blood pressure and cholesterol have strong interactions between 120-150mmHg and 180-300mg/dL cholester levels
Resting blood pressure and maximum heart rate achieved have strong interactions between 110-150mmHg and 140-180BPM
Cholesterol and resting blood pressure have strong interactions in the 180-200mg/dL range and 120-150mmHg
Cholesterol and maximum heart rate achieved have strong interactions between 200-300mg/dL and 140-180BPM
I'm going to quickly rename the column labels for the dataframe because to make interpretation of results easier moving forward.
df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol',
'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved','exercise_induced_angina',
'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'diagnosis']
df
| age | sex | chest_pain_type | resting_blood_pressure | cholesterol | fasting_blood_sugar | rest_ecg | max_heart_rate_achieved | exercise_induced_angina | st_depression | st_slope | num_major_vessels | thalassemia | diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | 1 | 0 | 125 | 212 | 0 | 1 | 168 | 0 | 1.0 | 2 | 2 | 3 | 0 |
| 1 | 53 | 1 | 0 | 140 | 203 | 1 | 0 | 155 | 1 | 3.1 | 0 | 0 | 3 | 0 |
| 2 | 70 | 1 | 0 | 145 | 174 | 0 | 1 | 125 | 1 | 2.6 | 0 | 0 | 3 | 0 |
| 3 | 61 | 1 | 0 | 148 | 203 | 0 | 1 | 161 | 0 | 0.0 | 2 | 1 | 3 | 0 |
| 4 | 62 | 0 | 0 | 138 | 294 | 1 | 1 | 106 | 0 | 1.9 | 1 | 3 | 2 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | 1 | 1 | 140 | 221 | 0 | 1 | 164 | 1 | 0.0 | 2 | 0 | 2 | 1 |
| 1021 | 60 | 1 | 0 | 125 | 258 | 0 | 0 | 141 | 1 | 2.8 | 1 | 1 | 3 | 0 |
| 1022 | 47 | 1 | 0 | 110 | 275 | 0 | 0 | 118 | 1 | 1.0 | 1 | 1 | 2 | 0 |
| 1023 | 50 | 0 | 0 | 110 | 254 | 0 | 0 | 159 | 0 | 0.0 | 2 | 0 | 2 | 1 |
| 1024 | 54 | 1 | 0 | 120 | 188 | 0 | 1 | 113 | 0 | 1.4 | 1 | 1 | 3 | 0 |
1025 rows × 14 columns
I'm also going to change the values of the categorical variables to things that are more readily interpretable to me since it was a bit annoying to keep reminding myself of the definitions of these different attributes.
df.loc[df['sex'] == 0, 'sex'] = 'female'
df.loc[df['sex'] == 1, 'sex'] = 'male'
df.loc[df['chest_pain_type'] == 0, 'chest_pain_type'] = 'typical angina'
df.loc[df['chest_pain_type'] == 1, 'chest_pain_type'] = 'atypical angina'
df.loc[df['chest_pain_type'] == 2, 'chest_pain_type'] = 'non-anginal pain'
df.loc[df['chest_pain_type'] == 3, 'chest_pain_type'] = 'asymptomatic'
df.loc[df['fasting_blood_sugar'] == 0, 'fasting_blood_sugar'] = 'lower than 120mg/ml'
df.loc[df['fasting_blood_sugar'] == 1, 'fasting_blood_sugar'] = 'greater than 120mg/ml'
df.loc[df['rest_ecg'] == 0, 'rest_ecg'] = 'normal'
df.loc[df['rest_ecg'] == 1, 'rest_ecg'] = 'ST-T wave abnormality'
df.loc[df['rest_ecg'] == 2, 'rest_ecg'] = 'left ventricular hypertrophy'
df.loc[df['exercise_induced_angina'] == 0, 'exercise_induced_angina'] = 'no'
df.loc[df['exercise_induced_angina'] == 1, 'exercise_induced_angina'] = 'yes'
df.loc[df['st_slope'] == 0, 'st_slope'] = 'upslope'
df.loc[df['st_slope'] == 1, 'st_slope'] = 'flat'
df.loc[df['st_slope'] == 2, 'st_slope'] = 'downslope'
df.loc[df['thalassemia'] == 1, 'thalassemia'] = 'fixed defect'
df.loc[df['thalassemia'] == 2, 'thalassemia'] = 'normal'
df.loc[df['thalassemia'] == 3, 'thalassemia'] = 'reversable defect'
df
| age | sex | chest_pain_type | resting_blood_pressure | cholesterol | fasting_blood_sugar | rest_ecg | max_heart_rate_achieved | exercise_induced_angina | st_depression | st_slope | num_major_vessels | thalassemia | diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | male | typical angina | 125 | 212 | lower than 120mg/ml | ST-T wave abnormality | 168 | no | 1.0 | downslope | 2 | reversable defect | 0 |
| 1 | 53 | male | typical angina | 140 | 203 | greater than 120mg/ml | normal | 155 | yes | 3.1 | upslope | 0 | reversable defect | 0 |
| 2 | 70 | male | typical angina | 145 | 174 | lower than 120mg/ml | ST-T wave abnormality | 125 | yes | 2.6 | upslope | 0 | reversable defect | 0 |
| 3 | 61 | male | typical angina | 148 | 203 | lower than 120mg/ml | ST-T wave abnormality | 161 | no | 0.0 | downslope | 1 | reversable defect | 0 |
| 4 | 62 | female | typical angina | 138 | 294 | greater than 120mg/ml | ST-T wave abnormality | 106 | no | 1.9 | flat | 3 | normal | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | male | atypical angina | 140 | 221 | lower than 120mg/ml | ST-T wave abnormality | 164 | yes | 0.0 | downslope | 0 | normal | 1 |
| 1021 | 60 | male | typical angina | 125 | 258 | lower than 120mg/ml | normal | 141 | yes | 2.8 | flat | 1 | reversable defect | 0 |
| 1022 | 47 | male | typical angina | 110 | 275 | lower than 120mg/ml | normal | 118 | yes | 1.0 | flat | 1 | normal | 0 |
| 1023 | 50 | female | typical angina | 110 | 254 | lower than 120mg/ml | normal | 159 | no | 0.0 | downslope | 0 | normal | 1 |
| 1024 | 54 | male | typical angina | 120 | 188 | lower than 120mg/ml | ST-T wave abnormality | 113 | no | 1.4 | flat | 1 | reversable defect | 0 |
1025 rows × 14 columns
Much better! Let me make sure that the they remain as categorical variables.
df.dtypes
age int64 sex object chest_pain_type object resting_blood_pressure int64 cholesterol int64 fasting_blood_sugar object rest_ecg object max_heart_rate_achieved int64 exercise_induced_angina object st_depression float64 st_slope object num_major_vessels int64 thalassemia object diagnosis int64 dtype: object
Finally, I'll generate dummy variables for the categorical variables in order to be able to use them in the feature selection process.
df = pd.get_dummies(df, drop_first=True)
df
| age | resting_blood_pressure | cholesterol | max_heart_rate_achieved | st_depression | num_major_vessels | diagnosis | sex_male | chest_pain_type_atypical angina | chest_pain_type_non-anginal pain | chest_pain_type_typical angina | fasting_blood_sugar_lower than 120mg/ml | rest_ecg_left ventricular hypertrophy | rest_ecg_normal | exercise_induced_angina_yes | st_slope_flat | st_slope_upslope | thalassemia_fixed defect | thalassemia_normal | thalassemia_reversable defect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | 125 | 212 | 168 | 1.0 | 2 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 53 | 140 | 203 | 155 | 3.1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 2 | 70 | 145 | 174 | 125 | 2.6 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 3 | 61 | 148 | 203 | 161 | 0.0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 62 | 138 | 294 | 106 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | 140 | 221 | 164 | 0.0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 1021 | 60 | 125 | 258 | 141 | 2.8 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 1022 | 47 | 110 | 275 | 118 | 1.0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1023 | 50 | 110 | 254 | 159 | 0.0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1024 | 54 | 120 | 188 | 113 | 1.4 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1025 rows × 20 columns
Finally, I'll move the diagnosis column to the end in preparation for the next stage of the process.
new_cols = [col for col in df.columns if col != 'diagnosis'] + ['diagnosis']
df = df[new_cols]
df
| age | resting_blood_pressure | cholesterol | max_heart_rate_achieved | st_depression | num_major_vessels | sex_male | chest_pain_type_atypical angina | chest_pain_type_non-anginal pain | chest_pain_type_typical angina | fasting_blood_sugar_lower than 120mg/ml | rest_ecg_left ventricular hypertrophy | rest_ecg_normal | exercise_induced_angina_yes | st_slope_flat | st_slope_upslope | thalassemia_fixed defect | thalassemia_normal | thalassemia_reversable defect | diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | 125 | 212 | 168 | 1.0 | 2 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 53 | 140 | 203 | 155 | 3.1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 70 | 145 | 174 | 125 | 2.6 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 61 | 148 | 203 | 161 | 0.0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 62 | 138 | 294 | 106 | 1.9 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | 140 | 221 | 164 | 0.0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1021 | 60 | 125 | 258 | 141 | 2.8 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
| 1022 | 47 | 110 | 275 | 118 | 1.0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
| 1023 | 50 | 110 | 254 | 159 | 0.0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1024 | 54 | 120 | 188 | 113 | 1.4 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1025 rows × 20 columns
sns.pairplot(df, hue='diagnosis')
<seaborn.axisgrid.PairGrid at 0x2adbc766608>
Perfect! Now we are ready to move forward.
We can also use feature_selection from sklearn to give us the chi squared statistic value. The null hypothesis for chi2 test is that two categorical variables are independent. Therefore, a higher value of the chi2 statistic means that two categorical variables are dependent and are thus more useful for classification. In other words, the highest chi squared values will represent the strongest relationships between our dependent variable (the diagnosis) and our independent variable.
data = df.copy()
X = data.iloc[:,0:19] #independent variable columns
y = data.iloc[:,-1] #dependent variable column
#Apply SelectKBest to extract best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#Concatenate two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Attributes','chi squared score'] #naming the resulting dataframe columns
print(featureScores.nlargest(19,'chi squared score')) #print 10 best features
Attributes chi squared score 3 max_heart_rate_achieved 650.008493 4 st_depression 253.653461 5 num_major_vessels 210.625919 9 chest_pain_type_typical angina 142.563300 18 thalassemia_reversable defect 141.524151 13 exercise_induced_angina_yes 130.470927 17 thalassemia_normal 129.833983 2 cholesterol 110.723364 0 age 81.425368 8 chest_pain_type_non-anginal pain 75.643418 14 st_slope_flat 66.295938 7 chest_pain_type_atypical angina 55.917533 1 resting_blood_pressure 45.974069 6 sex_male 24.373650 12 rest_ecg_normal 13.568869 16 thalassemia_fixed defect 8.772019 11 rest_ecg_left ventricular hypertrophy 5.888640 15 st_slope_upslope 5.381752 10 fasting_blood_sugar_lower than 120mg/ml 0.259249
The results of our feature selection indicate that the max_heart_rate_achieved is the best feature to use. In general, the less features our model has, the easier the interpretation of the result. I'll choose the the top 3 features based on their chi squared score to use as my independent variables for now.
Let's start by defining our independent variables which we can do by keeping only the 3 features we'll be using as our independent variables based on the feature selection results we obtained above.
cols_to_keep = ['max_heart_rate_achieved', 'st_depression', 'num_major_vessels']
x = df[cols_to_keep]
x
| max_heart_rate_achieved | st_depression | num_major_vessels | |
|---|---|---|---|
| 0 | 168 | 1.0 | 2 |
| 1 | 155 | 3.1 | 0 |
| 2 | 125 | 2.6 | 0 |
| 3 | 161 | 0.0 | 1 |
| 4 | 106 | 1.9 | 3 |
| ... | ... | ... | ... |
| 1020 | 164 | 0.0 | 0 |
| 1021 | 141 | 2.8 | 1 |
| 1022 | 118 | 1.0 | 1 |
| 1023 | 159 | 0.0 | 0 |
| 1024 | 113 | 1.4 | 1 |
1025 rows × 3 columns
And now our dependent variable is selected as such
y = df['diagnosis']
y
0 0
1 0
2 0
3 0
4 0
..
1020 1
1021 0
1022 0
1023 1
1024 0
Name: diagnosis, Length: 1025, dtype: int64
Next, I'll start with an 80:20 split for our testing and training sets which I'll then normalize via min-max normalization.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)
#Normalize independent variable data for both sets
X_train = (X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)).values
X_test = (X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)).values
Now I'll write a little function to streamline the process of calculating, reporting and storing the accuracy store for each model. This function will report the accuracy score as well as the sensitivity and specificity for each model. Each of these quantitites will be neatly placed into a pandas dataframe for each classifier model we built. The dataframe for each model will then be concatenated into one dataframe for comparison between models.
def model_assess(model, title = "Default"):
model.fit(X_train, y_train)
preds = model.predict(X_test)
cm = confusion_matrix(y_test, preds)
sensitivity = cm[0,0]/(cm[0,0]+cm[1,0])
specificity = cm[1,1]/(cm[1,1]+cm[0,1])
#print('Accuracy', title, ':', round(accuracy_score(y_test, preds), 5), '\n')
score = round(accuracy_score(y_test, preds), 5);
rf_results = pd.DataFrame([title,score,sensitivity,specificity]).transpose()
rf_results.columns = ['Method','Training Score', 'Sensitivity', 'Specificity']
#print(classification_report(y_test,preds))
return score, rf_results
Let's run the 10 different models and see what happens!
#1. Linear Perceptron Classifier
lp = Perceptron()
score, lp_df = model_assess(lp, "Linear Perceptron Classifier")
#2. Gaussian Naive Bayes
nb = GaussianNB()
score,nb_df = model_assess(nb, "Gaussian Naive Bayes")
#3. Random Forest Classfier
rf = RandomForestClassifier(n_estimators = 20, random_state = 12, max_depth = 5)
score,rf_df = model_assess(rf, "Random Forest Classfier")
#4. Extreme Gradient Boost
xgb = XGBClassifier(learning_rate=0.01, n_estimators=25, max_depth=15,gamma=0.6,
subsample=0.52,colsample_bytree=0.6,seed=27,
reg_lambda=2, booster='dart',
colsample_bylevel=0.6, colsample_bynode=0.5)
score,xgb_df = model_assess(xgb, "Extreme Gradient Boost")
#5. K-NeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
score,knn_df = model_assess(lr, "K-NeighborsClassifier")
#6. Decision Tree Classifier
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0, max_depth = 6)
score,dt_df = model_assess(dt, "Decision Tree Classifier")
#7. Gradient Boosting Classifier
gbc = GradientBoostingClassifier()
score,gbc_df = model_assess(gbc, "Gradient Boosting Classifier")
#8. Gaussian Process Classifier
gp = GaussianProcessClassifier()
score,gp_df = model_assess(gp, "Gaussian Process Classifier")
#9. Support Vector Classifier
svc = SVC()
score,svc_df = model_assess(svc, "Support Vector Classifier")
#10. MLP Classifier
mlp = MLPClassifier(max_iter = 1000, learning_rate_init=0.001, solver='adam')
score,mlp_df = model_assess(mlp, "MLP Classifier")
#Compile result dataframes for each model
score_df_list = [lp_df, nb_df, rf_df, xgb_df, knn_df, dt_df, gbc_df, gp_df, svc_df, mlp_df]
score_df_results = pd.concat(score_df_list,
ignore_index=True).sort_values('Training Score',axis = 0, ascending = False)
score_df_results
| Method | Training Score | Sensitivity | Specificity | |
|---|---|---|---|---|
| 6 | Gradient Boosting Classifier | 0.83415 | 0.833333 | 0.834951 |
| 5 | Decision Tree Classifier | 0.80976 | 0.778761 | 0.847826 |
| 2 | Random Forest Classfier | 0.78537 | 0.829545 | 0.752137 |
| 3 | Extreme Gradient Boost | 0.7561 | 0.776596 | 0.738739 |
| 8 | Support Vector Classifier | 0.73659 | 0.722222 | 0.752577 |
| 9 | MLP Classifier | 0.73659 | 0.76087 | 0.716814 |
| 4 | K-NeighborsClassifier | 0.72683 | 0.75 | 0.707965 |
| 7 | Gaussian Process Classifier | 0.72195 | 0.752809 | 0.698276 |
| 1 | Gaussian Naive Bayes | 0.65366 | 0.682353 | 0.633333 |
| 0 | Linear Perceptron Classifier | 0.57561 | 1.0 | 0.542105 |
Hmm, the Gradient Boosting Classifier performed the best with a 0.83 accuracy score. Can I get improved performance via a different dataset split? Let me try a 70:30 split and then we'll see if the models improve.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
#Normalize independent variable data for both sets
X_train = (X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)).values
X_test = (X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)).values
lp = Perceptron()
score, lp_df = model_assess(lp, "Linear Perceptron Classifier")
nb = GaussianNB()
score,nb_df = model_assess(nb, "Gaussian Naive Bayes")
rf = RandomForestClassifier(n_estimators = 20, random_state = 12, max_depth = 5)
score,rf_df = model_assess(rf, "Random Forest Classfier")
xgb = XGBClassifier(learning_rate=0.01, n_estimators=25, max_depth=15,gamma=0.6,
subsample=0.52,colsample_bytree=0.6,seed=27,
reg_lambda=2, booster='dart',
colsample_bylevel=0.6, colsample_bynode=0.5)
score,xgb_df = model_assess(xgb, "Extreme Gradient Boost")
knn = KNeighborsClassifier(n_neighbors=10)
score,knn_df = model_assess(lr, "K-NeighborsClassifier")
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0, max_depth = 6)
score,dt_df = model_assess(dt, "Decision Tree Classifier")
gbc = GradientBoostingClassifier()
score,gbc_df = model_assess(gbc, "Gradient Boosting Classifier")
gp = GaussianProcessClassifier()
score,gp_df = model_assess(gp, "Gaussian Process Classifier")
svc = SVC()
score,svc_df = model_assess(svc, "Support Vector Classifier")
mlp = MLPClassifier(max_iter = 1000, learning_rate_init=0.001, solver='adam')
score,mlp_df = model_assess(mlp, "MLP Classifier")
score_df_list = [lp_df, nb_df, rf_df, xgb_df, knn_df, dt_df, gbc_df, gp_df, svc_df, mlp_df]
score_df_results = pd.concat(score_df_list,
ignore_index=True).sort_values('Training Score',axis = 0, ascending = False)
score_df_results
| Method | Training Score | Sensitivity | Specificity | |
|---|---|---|---|---|
| 6 | Gradient Boosting Classifier | 0.8539 | 0.860759 | 0.846667 |
| 5 | Decision Tree Classifier | 0.82792 | 0.82716 | 0.828767 |
| 2 | Random Forest Classfier | 0.82143 | 0.837662 | 0.805195 |
| 3 | Extreme Gradient Boost | 0.77922 | 0.837037 | 0.734104 |
| 8 | Support Vector Classifier | 0.76299 | 0.762195 | 0.763889 |
| 9 | MLP Classifier | 0.76299 | 0.807143 | 0.72619 |
| 7 | Gaussian Process Classifier | 0.75 | 0.797101 | 0.711765 |
| 4 | K-NeighborsClassifier | 0.74351 | 0.789855 | 0.705882 |
| 0 | Linear Perceptron Classifier | 0.70779 | 0.948052 | 0.627706 |
| 1 | Gaussian Naive Bayes | 0.69156 | 0.746154 | 0.651685 |
That split improved the models. I wonder if extending the number of independent variable would aid us? I'll include the top 5 features instead of just the top 3 as I did before.
cols_to_keep = ['max_heart_rate_achieved', 'st_depression', 'num_major_vessels',
'chest_pain_type_typical angina','thalassemia_reversable defect']
x = df[cols_to_keep]
x
| max_heart_rate_achieved | st_depression | num_major_vessels | chest_pain_type_typical angina | thalassemia_reversable defect | |
|---|---|---|---|---|---|
| 0 | 168 | 1.0 | 2 | 1 | 1 |
| 1 | 155 | 3.1 | 0 | 1 | 1 |
| 2 | 125 | 2.6 | 0 | 1 | 1 |
| 3 | 161 | 0.0 | 1 | 1 | 1 |
| 4 | 106 | 1.9 | 3 | 1 | 0 |
| ... | ... | ... | ... | ... | ... |
| 1020 | 164 | 0.0 | 0 | 0 | 0 |
| 1021 | 141 | 2.8 | 1 | 1 | 1 |
| 1022 | 118 | 1.0 | 1 | 1 | 0 |
| 1023 | 159 | 0.0 | 0 | 1 | 0 |
| 1024 | 113 | 1.4 | 1 | 1 | 1 |
1025 rows × 5 columns
y = df['diagnosis']
y
0 0
1 0
2 0
3 0
4 0
..
1020 1
1021 0
1022 0
1023 1
1024 0
Name: diagnosis, Length: 1025, dtype: int64
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
#Normalize independent variable data for both sets
X_train = (X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)).values
X_test = (X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)).values
lp = Perceptron()
score, lp_df = model_assess(lp, "Linear Perceptron Classifier")
nb = GaussianNB()
score,nb_df = model_assess(nb, "Gaussian Naive Bayes")
rf = RandomForestClassifier(n_estimators = 20, random_state = 12, max_depth = 5)
score,rf_df = model_assess(rf, "Random Forest Classfier")
xgb = XGBClassifier(learning_rate=0.01, n_estimators=25, max_depth=15,gamma=0.6,
subsample=0.52,colsample_bytree=0.6,seed=27,
reg_lambda=2, booster='dart',
colsample_bylevel=0.6, colsample_bynode=0.5)
score,xgb_df = model_assess(xgb, "Extreme Gradient Boost")
knn = KNeighborsClassifier(n_neighbors=10)
score,knn_df = model_assess(lr, "K-NeighborsClassifier")
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0, max_depth = 6)
score,dt_df = model_assess(dt, "Decision Tree Classifier")
gbc = GradientBoostingClassifier()
score,gbc_df = model_assess(gbc, "Gradient Boosting Classifier")
gp = GaussianProcessClassifier()
score,gp_df = model_assess(gp, "Gaussian Process Classifier")
svc = SVC()
score,svc_df = model_assess(svc, "Support Vector Classifier")
mlp = MLPClassifier(max_iter = 1000, learning_rate_init=0.001, solver='adam')
score,mlp_df = model_assess(mlp, "MLP Classifier")
score_df_list = [lp_df, nb_df, rf_df, xgb_df, knn_df, dt_df, gbc_df, gp_df, svc_df, mlp_df]
score_df_results = pd.concat(score_df_list,
ignore_index=True).sort_values('Training Score',axis = 0, ascending = False)
score_df_results
| Method | Training Score | Sensitivity | Specificity | |
|---|---|---|---|---|
| 6 | Gradient Boosting Classifier | 0.86688 | 0.90411 | 0.833333 |
| 2 | Random Forest Classfier | 0.8539 | 0.913043 | 0.805882 |
| 5 | Decision Tree Classifier | 0.8539 | 0.919118 | 0.802326 |
| 4 | K-NeighborsClassifier | 0.81818 | 0.875912 | 0.77193 |
| 7 | Gaussian Process Classifier | 0.81818 | 0.887218 | 0.765714 |
| 3 | Extreme Gradient Boost | 0.81494 | 0.892308 | 0.758427 |
| 8 | Support Vector Classifier | 0.81494 | 0.892308 | 0.758427 |
| 9 | MLP Classifier | 0.81494 | 0.859155 | 0.777108 |
| 1 | Gaussian Naive Bayes | 0.81169 | 0.858156 | 0.772455 |
| 0 | Linear Perceptron Classifier | 0.79221 | 0.913043 | 0.720207 |
Using the top 5 features improved the accuracy of the models. Out of curiosity, let's see what happens if I use all the attributes we have at our disposal for our model.
y = df['diagnosis']
x = df.drop(['diagnosis'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
#Normalize independent variable data for both sets
X_train = (X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)).values
X_test = (X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)).values
lp = Perceptron()
score, lp_df = model_assess(lp, "Linear Perceptron Classifier")
nb = GaussianNB()
score,nb_df = model_assess(nb, "Gaussian Naive Bayes")
rf = RandomForestClassifier(n_estimators = 20, random_state = 12, max_depth = 5)
score,rf_df = model_assess(rf, "Random Forest Classfier")
xgb = XGBClassifier(learning_rate=0.01, n_estimators=25, max_depth=15,gamma=0.6,
subsample=0.52,colsample_bytree=0.6,seed=27,
reg_lambda=2, booster='dart',
colsample_bylevel=0.6, colsample_bynode=0.5)
score,xgb_df = model_assess(xgb, "Extreme Gradient Boost")
knn = KNeighborsClassifier(n_neighbors=10)
score,knn_df = model_assess(knn, "K-NeighborsClassifier")
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0, max_depth = 6)
score,dt_df = model_assess(dt, "Decision Tree Classifier")
gbc = GradientBoostingClassifier()
score,gbc_df = model_assess(gbc, "Gradient Boosting Classifier")
gp = GaussianProcessClassifier()
score,gp_df = model_assess(gp, "Gaussian Process Classifier")
svc = SVC()
score,svc_df = model_assess(svc, "Support Vector Classifier")
mlp = MLPClassifier(max_iter = 2000, learning_rate_init = 0.0001, solver='adam')
score,mlp_df = model_assess(mlp, "MLP Classifier")
score_df_list = [lp_df, nb_df, rf_df, xgb_df, knn_df, dt_df, gbc_df, gp_df, svc_df, mlp_df]
score_df_results = pd.concat(score_df_list,
ignore_index=True).sort_values('Training Score',axis = 0, ascending = False)
score_df_results
| Method | Training Score | Sensitivity | Specificity | |
|---|---|---|---|---|
| 6 | Gradient Boosting Classifier | 0.89935 | 0.970588 | 0.843023 |
| 2 | Random Forest Classfier | 0.88636 | 0.962687 | 0.827586 |
| 5 | Decision Tree Classifier | 0.87013 | 0.96124 | 0.804469 |
| 8 | Support Vector Classifier | 0.86364 | 0.892617 | 0.836478 |
| 7 | Gaussian Process Classifier | 0.85714 | 0.885906 | 0.830189 |
| 9 | MLP Classifier | 0.8474 | 0.911765 | 0.796512 |
| 4 | K-NeighborsClassifier | 0.84091 | 0.866667 | 0.816456 |
| 3 | Extreme Gradient Boost | 0.83442 | 0.875 | 0.79878 |
| 1 | Gaussian Naive Bayes | 0.81169 | 0.868613 | 0.766082 |
| 0 | Linear Perceptron Classifier | 0.79221 | 0.841727 | 0.751479 |
The Gradient Boosting Classifier still produces the best results for all metrics (i.e., training score, sensitivity and specificity). It's clossely followed by the Random Forest and Decision Tree Classifiers. Let's focus on the GBC for a bit.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
#Normalize independent variable data for both sets
X_train = (X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)).values
X_test = (X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)).values
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
y_gbc_train_pred = gbc.predict(X_train) #Predicted y values using training set
y_gbc_test_pred = gbc.predict(X_test) #Predicted y values using test set
plt.rcParams.update({'font.size': 15})
skplt.metrics.plot_confusion_matrix(y_test, y_gbc_test_pred,
normalize=False, title = 'Confusion Matrix for GBC')
plt.show()
The confusion matrix shows that 4 instances are clasified incorrectly as benign (False negatives). Also, 27 benign cases have been classified as malignant (False positives). Can we improve this further? Let's try a grid search exploration.
y = df['diagnosis']
x = df.drop(['diagnosis'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
#Normalize independent variable data for both sets
X_train = (X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)).values
X_test = (X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)).values
gbc = GradientBoostingClassifier()
grid_values = {
'learning_rate':[0.001, 0.01, 0.1, 1, 10],
'n_estimators':[100, 200, 300, 400, 500],
'max_depth': [2, 3, 4, 5, 6]
}
grid_clf_acc = GridSearchCV(gbc, param_grid = grid_values, scoring = 'recall')
grid_clf_acc.fit(X_train, y_train)
#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(X_test)
# New Model Evaluation metrics
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))
#Logistic Regression (Grid Search) Confusion matrix
confusion_matrix(y_test,y_pred_acc)
Accuracy Score : 0.9383116883116883 Precision Score : 0.9166666666666666 Recall Score : 0.959731543624161 F1 Score : 0.9377049180327869
array([[146, 13],
[ 6, 143]], dtype=int64)
Cool! The accuracy score increased significantly after doing our grid search! Let's see the confusion matrix
skplt.metrics.plot_confusion_matrix(y_test, y_pred_acc,
normalize=False, title = 'Confusion Matrix for GBC')
<AxesSubplot:title={'center':'Confusion Matrix for GBC'}, xlabel='Predicted label', ylabel='True label'>
The confusion matrix now shows that 6 instances are clasified incorrectly as benign (False negatives) which is 2 more than what we had before. Also, 13 benign cases have been classified as malignant (False positives) which is 14 less than before. That's quite the improvement! Below are the parameters of the optimized model.
params = gbc.get_params()
params
{'ccp_alpha': 0.0,
'criterion': 'friedman_mse',
'init': None,
'learning_rate': 0.1,
'loss': 'deviance',
'max_depth': 3,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_iter_no_change': None,
'random_state': None,
'subsample': 1.0,
'tol': 0.0001,
'validation_fraction': 0.1,
'verbose': 0,
'warm_start': False}
fpr, tpr, thresholds = roc_curve(y_test,y_pred_acc)
print("The AUC is ",auc(fpr, tpr))
sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
plt.title('Reciver Operating Characterstic Curve')
plt.ylabel('True positive rate')
plt.xlabel('False positive rate')
plt.plot(fpr,tpr,label='Gradient Boosting Classifier')
plt.plot([0,1],ls='--')
plt.show()
The AUC is 0.9389852686674264
Awesome! Our model has a high degree of separability which is desired.
I'll rerun the fit using the optimized parameters from the grid search in order to instantiate the model for subsequent analysis. (For some reason the model wasn't being recognized...)
y = df['diagnosis']
x = df.drop(['diagnosis'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
#Normalize independent variable data for both sets
X_train = (X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)).values
X_test = (X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)).values
gbc = GradientBoostingClassifier(ccp_alpha = 0.0,
criterion = 'friedman_mse',
init = None,
learning_rate= 0.1,
loss= 'deviance',
max_depth= 3,
max_features= None,
max_leaf_nodes= None,
min_impurity_decrease= 0.0,
min_samples_leaf= 1,
min_samples_split= 2,
min_weight_fraction_leaf= 0.0,
n_estimators= 100,
n_iter_no_change= None,
random_state= None,
subsample= 1.0,
tol = 0.0001,
validation_fraction = 0.1,
verbose = 0,
warm_start = False)
gbc.fit(X_train, y_train)
y_gbc_train_pred = gbc.predict(X_train) #Predicted y values using training set
y_gbc_test_pred = gbc.predict(X_test) #Predicted y values using test set
perm = PermutationImportance(gbc, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())
| Weight | Feature |
|---|---|
| 0.0909 ± 0.0130 | num_major_vessels |
| 0.0714 ± 0.0232 | chest_pain_type_typical angina |
| 0.0682 ± 0.0433 | st_depression |
| 0.0351 ± 0.0132 | age |
| 0.0299 ± 0.0181 | thalassemia_normal |
| 0.0149 ± 0.0120 | sex_male |
| 0.0104 ± 0.0076 | resting_blood_pressure |
| 0.0097 ± 0.0041 | max_heart_rate_achieved |
| 0.0084 ± 0.0088 | rest_ecg_normal |
| 0.0078 ± 0.0052 | thalassemia_fixed defect |
| 0.0071 ± 0.0026 | exercise_induced_angina_yes |
| 0.0039 ± 0.0095 | thalassemia_reversable defect |
| 0.0006 ± 0.0026 | st_slope_flat |
| 0 ± 0.0000 | rest_ecg_left ventricular hypertrophy |
| 0 ± 0.0000 | fasting_blood_sugar_lower than 120mg/ml |
| 0 ± 0.0000 | chest_pain_type_atypical angina |
| 0 ± 0.0000 | st_slope_upslope |
| -0.0019 ± 0.0052 | chest_pain_type_non-anginal pain |
| -0.0097 ± 0.0109 | cholesterol |
The report shows that num_major_vessels, chest_pain_type_typical angina , st_depression, age and thalassemia_normal are the top 5 features that contribute the most to modeling the target variable. I'll look into the Partial Dependence Plots for these five variables to see what we can learn.
base_features = df.columns.values.tolist()
base_features.remove('diagnosis')
feat_name = 'num_major_vessels'
pdp_dist = pdp.pdp_isolate(model=gbc, dataset=X_test, model_features=base_features, feature=feat_name)
pdp.pdp_plot(pdp_dist, feat_name)
plt.show()
The PDP shows that as the number of major blood vessels increases, the probability of heart disease decreases. That makes sense since more available vessels means that more blood can get to the heart.
Next, let's look at the PDP for chest_pain_type_typical angina.
base_features = df.columns.values.tolist()
base_features.remove('diagnosis')
feat_name = 'chest_pain_type_typical angina'
pdp_dist = pdp.pdp_isolate(model=gbc, dataset=X_test, model_features=base_features, feature=feat_name)
pdp.pdp_plot(pdp_dist, feat_name)
plt.show()
Angina is a type of chest pain caused by reduced blood flow to the heart. There's different types of angina (stable , unstable, variant and refractory). Typical angina is common and is generally experienced during periods of physical exertion and usually goes away before long. The PDP for typical angina shows that when a patient exhibits this condition the likelihood of them having heart disease goes down which makes sense.
Next, I'll look into the st_depression PDP
base_features = df.columns.values.tolist()
base_features.remove('diagnosis')
feat_name = 'st_depression'
pdp_dist = pdp.pdp_isolate(model=gbc, dataset=X_test, model_features=base_features, feature=feat_name)
pdp.pdp_plot(pdp_dist, feat_name)
plt.show()
The st_depression also shows a general reduction in probability the higher it goes. This variable is related to the shape and direction of the electrocardiagram (ECG). More specifically, it's related to the ST segment in an ECG which is associated with the electrical activity of the heart after the right and left ventricles have contracted. Depending on the shape and direction of the ST segment, one can determine if there is a change in the blood flow to the heart which can be due to a heart attack, pericarditis, myocarditis, or a more severe type of angina mongst others.
Since the ST depression value described here corresponds to the signal induced after exercise relative to rest, then perhaps there is a way to increase our understanding of this variable via its interaction with another of the parameters in our attribute list. The st_slope values could be potentially useful so I'll generate a 2D PDP between them and the ST depression.
#Get interaction between st_depression and st_upslope
inter1 = pdp.pdp_interact(model=gbc, dataset=X_test, model_features=base_features,
features=['st_slope_upslope', 'st_depression'])
pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['st_slope_upslope', 'st_depression'],
plot_type='contour',figsize=(10,10),
plot_params = {'contour_color': 'black',
'cmap': 'coolwarm',})
plt.show()
The color scale represents the probability of getting a diagnosis of no heart disease. The "hotter" the color the higher the probability. The probability for a positive diagnosis of heart disease is highest when the ST depression is low.
#Get interaction between st_depression and st_flat
inter1 = pdp.pdp_interact(model=gbc, dataset=X_test, model_features=base_features,
features=['st_slope_flat', 'st_depression'])
pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['st_slope_flat', 'st_depression'],
plot_type='contour',figsize=(10,10),
plot_params = {'contour_color': 'black',
'cmap': 'coolwarm',})
plt.show()
The same thing can be seen when looking at the interaction between the ST depression and the flat_slope values. The probability for a positive diagnosis of heart disease is highest when the ST depression is low.
Let's now look at, the PDP for age below
base_features = df.columns.values.tolist()
base_features.remove('diagnosis')
feat_name = 'age'
pdp_dist = pdp.pdp_isolate(model=gbc, dataset=X_test, model_features=base_features, feature=feat_name)
pdp.pdp_plot(pdp_dist, feat_name)
plt.show()
The PDP for age shows that the probability of a diagnosis of heart disease increases with age which makes sense.
Finally, let's look at the PDP for thallassemia_normal
base_features = df.columns.values.tolist()
base_features.remove('diagnosis')
feat_name = 'thalassemia_normal'
pdp_dist = pdp.pdp_isolate(model=gbc, dataset=X_test, model_features=base_features, feature=feat_name)
pdp.pdp_plot(pdp_dist, feat_name)
plt.show()
Hmm, the model suggests that the probability of a diagnosis of heart disease increases when thalassemia has a normal value... Strange...
Let's plot them and see what happens
explainer = shap.Explainer(gbc)
shap_values = explainer.shap_values(X_test)
#Include the SHAP values for each feature in the plot
feature_names = [a + ": " + str(b) for a,b
in zip(X_test.columns, np.abs(shap_values).mean(0).round(2))]
shap.summary_plot(shap_values, X_test, plot_type="bar")
The plot above shows the absolute SHAP value which tells us how much a single feature affected the prediction.The number of major vessels, typical angina chest pain and normal thalassemia were the top 3 factors for this observation.
Let's plot the bee swarm plot to get the feature importance over the entire dataset.
shap.summary_plot(shap_values, X_test,feature_names=feature_names)
From this plot, red values indicate a positive contribution (i.e., heart disease) while blue values indicate a negative contribution (i.e., no heart disease) to the prediction. From here we can see the following:
This allows us to get the local importance of each contribution. I've also plotted a violin plot of the data distributions below.
shap.summary_plot(shap_values, X_test, plot_type='violin',feature_names=feature_names)
Now I can use force plots with the SHAP values to determine how the different attributes are contributing to their predicted diagnosis.
def HD_risk_factors_SHAP(model, patient):
explainer = shap.Explainer(model)
shap_values = explainer.shap_values(patient)
shap.initjs()
return shap.force_plot(explainer.expected_value,
shap_values,
patient,
link = 'logit')
data_for_prediction = X_test.iloc[0,:].astype(float)
HD_risk_factors_SHAP(gbc, data_for_prediction)
For this person the base probability for a diagnosis of heart disease is 53.14% and their prediction using the GBC model is 91%. This is due to them not having any major vessels show up from their fluoroscopy results, having normal thalassemia, and an upsloping ST segment. Therefore, this patient would be expected to have heart disease.
Let's look at another patient
data_for_prediction = X_test.iloc[300,:].astype(float)
HD_risk_factors_SHAP(gbc, data_for_prediction)
For this patient the base probability for a diagnosis of heart disease is 53.14% and their prediction using the GBC model is 18%. This is due to them having a major vessels show up from their fluoroscopy results, having normal thalassemia, and no typical angina chest pain. Therefore, this patient would be expected to not have heart disease.
We can generate SHAP dependence contribution plots in order to assess the relationship between a feature's values and the model's predicted outcomes.
Below I'll plot the SHAP dependence plots for each of the top 4 features. I'll also plot each of these features using some of the other model attributes to get a feel of the effect that the interaction effects have on the vertical dispersion of the SHAP values. Let's start by looking at the number of major vessels feature.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,8))
shap.dependence_plot('num_major_vessels', shap_values,
X_test, interaction_index="chest_pain_type_typical angina",
ax=axes[0, 0], show=False)
shap.dependence_plot('num_major_vessels', shap_values,
X_test, interaction_index="thalassemia_normal",
ax=axes[0, 1], show=False)
shap.dependence_plot('num_major_vessels', shap_values,
X_test, interaction_index="st_depression",
ax=axes[1, 0], show=False)
shap.dependence_plot('num_major_vessels', shap_values,
X_test, interaction_index="cholesterol",
ax=axes[1, 1], show=False)
plt.show()
The SHAP dependence plots for num_major_vessels show that SHAP value is positive when the there are no major vessels showing up in the fluoroscopy results. Otherwise, the SHAP value is negative. This means that having no vessels show up in fluoroscopy would be a strong indicator of heart disease.
Next, let's look at the typical angina dependence plots below.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,8))
shap.dependence_plot('chest_pain_type_typical angina', shap_values,
X_test, interaction_index="num_major_vessels",
ax=axes[0, 0], show=False)
shap.dependence_plot('chest_pain_type_typical angina', shap_values,
X_test, interaction_index="thalassemia_normal",
ax=axes[0, 1], show=False)
shap.dependence_plot('chest_pain_type_typical angina', shap_values,
X_test, interaction_index="st_depression",
ax=axes[1, 0], show=False)
shap.dependence_plot('chest_pain_type_typical angina', shap_values,
X_test, interaction_index="cholesterol",
ax=axes[1, 1], show=False)
plt.show()
Another noted difference between the SHAP values here. These plots suggest that not having a typical angina chest pain would be indicative of heart disease. This makes sense since there are more severe angina categories present in the model.
Next, let's look at the thalassemia_normal dependence plots.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,8))
shap.dependence_plot('thalassemia_normal', shap_values,
X_test, interaction_index="num_major_vessels",
ax=axes[0, 0], show=False)
shap.dependence_plot('thalassemia_normal', shap_values,
X_test, interaction_index="chest_pain_type_typical angina",
ax=axes[0, 1], show=False)
shap.dependence_plot('thalassemia_normal', shap_values,
X_test, interaction_index="st_depression",
ax=axes[1, 0], show=False)
shap.dependence_plot('thalassemia_normal', shap_values,
X_test, interaction_index="cholesterol",
ax=axes[1, 1], show=False)
plt.show()
Interesting, having normal thalassemia seems to be contributing towards a heart disease diagnosis.
Next, the ST depression dependence plots.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,8))
shap.dependence_plot('st_depression', shap_values,
X_test, interaction_index="num_major_vessels",
ax=axes[0, 0], show=False)
shap.dependence_plot('st_depression', shap_values,
X_test, interaction_index="chest_pain_type_typical angina",
ax=axes[0, 1], show=False)
shap.dependence_plot('st_depression', shap_values,
X_test, interaction_index="thalassemia_normal",
ax=axes[1, 0], show=False)
shap.dependence_plot('st_depression', shap_values,
X_test, interaction_index="cholesterol",
ax=axes[1, 1], show=False)
plt.show()
The bottom axis shows from left to right the trend of the ST segment going from a downslope to flat to an upslope. The SHAP values appear to be trending downwards when going from a downslope towards an upslope. This suggests that having an ST segment has a an upward slope would be indicative of heart disease.
The model results can be visualized via the force plot below for all 308 patients in our test set. The x-axis represents the patient ID while the y-axis represents the SHAP value. Each trace corresponds to one of the model parameters. Red regions correspond to a prediction of a heart disease diagnosis while blue regions correspond to regions of no heart disease diagnosis. Therefore, patients with large blue swaths in their parameter space would not be expected to have heart disease while patients with primarily red swaths would be expected to have/develop heart disease.
The plot below is interactive so play around with the different ways to visualize the results of the model :]
shap_values = explainer.shap_values(X_train.iloc[:307])
shap.force_plot(explainer.expected_value, shap_values, X_test.iloc[:307],link='logit')